Write a short description about the course and add a link
to your GitHub repository here. This is an R Markdown (.Rmd) file so you
should use R Markdown syntax.
The aim of the course is to teach how to use R, RStudio and GitHub as
well as how to do statistical analyses with R.
GitHub repository: https://github.com/phuoc-tn/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 6 04:20:54 2022"
paste("Today I stepped on", sample(1:100, size = 1, replace = FALSE), "LEGO brick(s).")
## [1] "Today I stepped on 83 LEGO brick(s)."
How are you feeling right now?
Quite tired.
What do you expect to learn?
How/when/why to do certain statistical analyses. Maybe new tricks to
use in R.
Where did you hear about the course?
My doctoral school informed me in an email.
How did it work as a “crash course” on modern R tools and
using RStudio?
Quite intensive.
Which were your favorite topics?
Data parsing, using partial matches to find columns.
Which topics were most difficult?
Transposing data frames.
Some other comments on the book and our new approach of
getting started with R Markdown etc.?
The online book is a great resource! I will definitely bookmark it
for future use.
Describe the work you have done this week and summarize your learning.
The main highlight for this week was that I learned how to do, (somewhat) understand, and interpret linear regression models and their results in R. I have more insight to identifying and validating variables and models to explain observed data. Let’s hope that I will get a chance to utilize these in the future.
date()
## [1] "Tue Dec 6 04:20:54 2022"
Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.
1) Load tidyverse.
library(tidyverse)
2) Import data from CSV file created in data wrangling.
learning2014 <- read_csv(file = "Data/learning2014.csv", col_names = TRUE)
3) Check structure and dimensions of data.
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
dim(learning2014)
## [1] 166 7
Answer: The data is based on a statistical study about how students’ learning approaches and strategies relate to their achievements in learning statistics in an introductory course taught in the University of Helsinki, Finland. It was sampled from the original data set of the study, and consists of seven variables (gender, age, attitude, deep, stra, surf, and points) measured with 166 observations acquired from a questionnaire. The attitude variable (adjusted by dividing original attidute scores by ten) describes the attitude of students toward learning statistics measured with ten statements. The variables deep, stra, and surf are abbreviated from deep (aim to engage, learn, and understand course material), strategic (aim to maximize points via any approach), and surface (aim to pass with minimal effort) learning approaches, respectively, and were measured in a 1–5 scale (1 = disagree, 5 = agree).
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
1.1) Use pairs() to visualize data.
pairs(learning2014[-1],
col = c("red", "blue")[factor(learning2014$gender)], # Use gender for colors.
oma = c(2, 2, 2, 10)) # Set the margins outside of plot.
par(xpd = TRUE) # Allow legend to be outside of plot area.
legend(0.95, 0.6, # Coordinates for legend position.
legend = unique(learning2014$gender),
fill = c("red", "blue"))
Getting colors and legend with pairs() was harder than expected. Maybe too hard…
1.2) Or use ggplot2 and GGally to visualize data.
library(tidyverse) # ggplot2 is part of tidyverse.
library(GGally)
ggpairs(
learning2014,
mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20))
) +
scale_color_manual(values = c("red", "blue")) + # Assign new colors manually.
scale_fill_manual(values = c("red", "blue")) + # Same here.
theme_bw() + # Change default theme to something more clean.
theme(strip.text = element_text(size = 14)) # Adjust panel title sizes.
2) Summarize data.
learning2014$gender <- as.factor(learning2014$gender) # Changed gender from character to factor for summary.
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Answer: The age of the students ranged from 17 to 55 with most being ca. 20 years old. Based on the results, this does not seem to correlate with the number of points acquired during the statistics course. Although participating female students (n = 110) outnumbered males (n = 56) by 2:1, there were minimal differences in points acquired between these two groups. The males however scored, on average, higher in attitude compared to females.
A significant positive correlation was found between the attitude towards learning statistics and high points; indicating that the former is a good predictor for the latter. The individual learning approaches (deep, surface, and strategic) did not strongly correlate with points. However, among them the strategic approach seems to positively correlate the most with higher points which aligns with the general aim the approach — maximize the number of points with any approach. This approach was also slightly preferred more by females compared to males on average.
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
1) Fit the learning approaches to linear model.
linear_model <- lm(points ~ deep + stra + surf, data = learning2014)
2) Check summary of linear model.
summary(linear_model)
##
## Call:
## lm(formula = points ~ deep + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1235 -3.0737 0.5226 4.2799 10.3229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.9143 5.1169 5.260 4.5e-07 ***
## deep -0.7443 0.8662 -0.859 0.3915
## stra 0.9878 0.5962 1.657 0.0994 .
## surf -1.6296 0.9153 -1.780 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared: 0.04071, Adjusted R-squared: 0.02295
## F-statistic: 2.292 on 3 and 162 DF, p-value: 0.08016
3) Remove the approaches due to them not having a significant relationship with exam points, and redo the test as instructed.
new_linear_model <- lm(points ~ 1, data = learning2014)
summary(new_linear_model)
##
## Call:
## lm(formula = points ~ 1, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7169 -3.7169 0.2831 5.0331 10.2831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.7169 0.4575 49.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.895 on 165 degrees of freedom
Answer: In short, linear regression model tests the correlation between predicting variables, which are in my case the three learning approached (deep, strategic, and surface), and predicted outcomes, i.e. the exam points. At a cursory view of the summary table and the p-values in the Coefficients section, none of the three approaches are statistically significant; meaning that there is no correlation between them and the exam points.
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.
Answer: The three learning approaches are not significant explanatory variables for the exam points. This can be seen, for example, in the previously mentioned Coefficients section, which shows the correlation and significance of the selected predictors (learning approaches in this case) and the exam points. The statistical significance of the predictors is shown with different symbols next to the p-values, and their corresponding values are shown immediately below (Signif. codes). The overall significance of the model — or the lack thereof — can also be seen in the F-statistic section.
The estimates in the Coefficients section suggest (interestingly enough) that only the strategic approach positively correlates with higher exam points. This makes sense assuming students use deep and/or surfaces learning approaches depending on how familiar they are with certain topics in statistics in order to maximize points.
The multiple R2 value shows how much the predicting variables explain the variance in the outcomes in our dataset. This means that the three approaches only explain ca. 4.1% of the variance. The adjusted R2 value, as name suggests, is scaled so that adding more variables does not result in higher R2 values. In this model, only 2.3% of the variance in exam points can be explained by the three approaches.
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.
1) Generate requested plots from the linear model with the three learning approaches.
par(mfrow = c(2, 2),
mar = c(2, 2, 2, 2)) # Decrease margins around plots.
plot(linear_model, which = c(1, 2, 5), lwd = 2) # lwd = line width.
Answer: The Residuals vs Fitted plot shows the linearity of the fitted values and their residuals. Comparing the red line to the dashed grey line shows that the linearity between the values and residuals in my linear model is not the greatest. The Normal QQ-plot shows whether the values of the model follow a normal distribution. Looking at the plot, the data mostly follows a normal distribution apart from the tails. The Residuals vs Leverage plot shows which data points are “influential” meaning which points significantly affect the results of the linear model. Data points outside the Cook’s distance shown in grey dashed line(s) would be influential. Because no data points in this model are outside of the line, none of them are influential.
date()
## [1] "Tue Dec 6 04:21:07 2022"
The CSV file for this week’s assignment contains modified data from a study done by Prof. Paulo Cortez and Alice Silva, and which was published in 2008 (source). The table has measurements (obtained via school reports and questionnaires) of variables related to the students’ performances in mathematics and Portuguese from secondary education in two Portuguese schools (Gabriel Pereira and Mousinho da Silveira). The following attributes are found in the dataset:
library(tidyverse)
student_alc <- read_csv(file = "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", col_names = TRUE)
colnames(student_alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The “high_use” is an added variable, and can be either TRUE or FALSE depending on whether “alc_use” is higher than two or not.
Due to the file containing so many interesting variables to choose from, I decided to let R — and fate — choose the ones I will be using for the rest of the exercises. These were:
set.seed(2135) # Seed number for reproducibility, when using random sampling.
interesting_variables <- sample(colnames(student_alc), size = 4, replace = FALSE)
interesting_variables
## [1] "freetime" "sex" "traveltime" "Pstatus"
The Free time after school could have connections to high alcohol usage as the more time you have after school, especially if you’re a student between 15 to 22 years old, the more time you have for a drink (or two). Sex might have also connections because men typically drink more that women. Travel time, which is specified to be from home to school, is a bit harder to imagine to be related to high alcohol consumption. Maybe Portuguese students drink during long trips to school? Who knows. ¯\_(ツ)_/¯ Lastly, the parent’s cohabitation status (together or apart) might have negative correlation due to children in two-parent households having both parents to look after them, thus having less opportunities or reasons to drink.
library(RColorBrewer)
library(patchwork)
library(gridExtra)
interesting_variables <- append(interesting_variables, "high_use")
student_alc %>%
select(all_of(interesting_variables)) %>% gather() %>%
ggplot(aes(value, fill = value)) +
geom_bar() +
scale_fill_brewer(palette = "Paired", guide = "none") +
theme_bw() +
theme(strip.text = element_text(size = 11, face = "bold"),
axis.title = element_text(size = 11, face = "bold")) +
facet_wrap("key", scales = "free")
Looking at the bar plots for the interesting variables, I notice that observations for free time are almost normally distributed. Most of the student don’t have high alcohol usage. Interestingly, there is quite a skew with the parent’s cohabitation status with most being together. The sex distribution is almost equal, and most students have a short travel time to school.
library(ggpubr)
new_student_alc <- student_alc %>%
select(all_of(interesting_variables))
plot_a <-
ggplot(new_student_alc, aes(x = high_use, y = freetime, fill = sex)) +
geom_boxplot() +
scale_fill_brewer(
palette = "Dark2",
name = "Sex",
labels = c("Female", "Male")
) +
scale_y_discrete(limits = c("very low", "low", "medium", "high", "very high")) +
ylab("Free time\n(after school)") +
theme_bw() +
theme(
axis.title.x = element_blank(),
axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold")
)
plot_b <-
ggplot(new_student_alc, aes(x = high_use, y = traveltime, fill = sex)) +
geom_boxplot() +
scale_fill_brewer(
palette = "Dark2",
name = "Sex",
labels = c("Female", "Male")
) +
scale_y_discrete(limits = c("<15 min", "15–30 min", "30–60 min", ">60 min")) +
ylab("Travel time\n(home to school)") +
theme_bw() +
theme(
axis.title.x = element_blank(),
axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold")
)
plot_c <-
ggplot(new_student_alc, aes(x = high_use, y = freetime, fill = Pstatus)) +
geom_boxplot() +
scale_fill_brewer(
palette = "Paired",
name = "Parent's\ncohabitation\nstatus",
labels = c("Apart", "Together")
) +
scale_y_discrete(limits = c("very low", "low", "medium", "high", "very high")) +
ylab("Free time\n(after school)") +
theme_bw() +
theme(
axis.title.x = element_blank(),
axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold")
)
plot_d <-
ggplot(new_student_alc, aes(x = high_use, y = traveltime, fill = Pstatus)) +
geom_boxplot() +
scale_fill_brewer(
palette = "Paired",
name = "Parent's\ncohabitation\nstatus",
labels = c("Apart", "Together")
) +
scale_y_discrete(limits = c("<15 min", "15–30 min", "30–60 min", ">60 min")) +
ylab("Travel time\n(home to school)") +
theme_bw() +
theme(
axis.title.x = element_blank(),
axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold")
)
joined_plot <- (plot_a + plot_b + plot_layout(guides = "collect")) /
(plot_c + plot_d + plot_layout(guides = "collect")) +
plot_annotation(tag_levels = "A") &
theme(plot.tag = element_text(face = "bold"))
grid.arrange(
patchworkGrob(joined_plot),
bottom = text_grob("High alcohol consumption", face = "bold")
)
Looking at the box plots, there doesn’t seem to be much difference in alcohol consumption with travel time among the sexes nor the parent’s cohabitation status (panels B and D). However with free time (panels A and C), there seems to be small increase with more free time. This somewhat aligns with my assumptions. The increase seems to be also higher on average with males (panel A), which aligns with my initial hypotheses. Curiously, in notice that students with separated parents have on average less free time, but also higher alcohol consumption. This contradicts my hypotheses.
model <- glm(high_use ~ freetime + sex + traveltime + Pstatus - 1, data = student_alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ freetime + sex + traveltime + Pstatus -
## 1, family = "binomial", data = student_alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3632 -0.8695 -0.6220 1.1984 1.9127
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## freetime 0.3023 0.1253 2.413 0.0158 *
## sexF -2.6603 0.5946 -4.474 7.69e-06 ***
## sexM -1.8937 0.6263 -3.023 0.0025 **
## traveltime 0.4015 0.1617 2.483 0.0130 *
## PstatusT -0.1924 0.3912 -0.492 0.6228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 512.93 on 370 degrees of freedom
## Residual deviance: 424.31 on 365 degrees of freedom
## AIC: 434.31
##
## Number of Fisher Scoring iterations: 4
The summary shows that Pstatus might not statistically significant in my model.
OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## freetime 1.35291603 1.06175190 1.7370368
## sexF 0.06992947 0.02089100 0.2168783
## sexM 0.15051787 0.04259748 0.5008024
## traveltime 1.49409723 1.08919695 2.0594122
## PstatusT 0.82496011 0.38967324 1.8280994
Pstatus has the widest confidence interval. The lower interval is below 1 and upper higher than 1, meaning there’s no statistically significant association with high alcohol usage. This seems to contradict previous results. Free and travel time has statistically significant positive association (upper and lower intervals >1), which aligns with my assumptions. Meanwhile sex has negative (upper and lower intervals <1), which contradicts my assumptions.
library(finalfit)
factor_student_alc <-
new_student_alc %>%
mutate(sex = factor(sex) %>%
fct_recode("Female" = "F",
"Male" = "M") %>%
ff_label("Sex")) %>%
mutate(
Pstatus = factor(Pstatus) %>%
fct_recode("Apart" = "A",
"Together" = "T") %>%
ff_label("Parent's cohabitation status")
) %>%
mutate(
traveltime = factor(traveltime) %>%
fct_recode(
"<15 min" = "1",
"15–30 min" = "2",
"30–60 min" = "3",
">60 min" = "4"
) %>%
ff_label("Travel time (home to school)")
) %>%
mutate(
freetime = factor(freetime) %>%
fct_recode(
"very low" = "1",
"low" = "2",
"medium" = "3",
"high" = "4",
"very high" = "5"
) %>%
ff_label("Free time (after school)")
) %>%
mutate(high_use = ff_label(high_use, "High alcohol consumption"))
factor_student_alc %>%
summary_factorlist(
"high_use",
c("freetime", "sex", "traveltime", "Pstatus"),
p = TRUE,
add_dependent_label = TRUE
)
## Warning in chisq.test(traveltime, high_use): Chi-squared approximation may be
## incorrect
## Dependent: High alcohol consumption FALSE TRUE p
## Free time (after school) very low 15 (5.8) 2 (1.8) 0.032
## low 46 (17.8) 14 (12.6)
## medium 112 (43.2) 40 (36.0)
## high 65 (25.1) 40 (36.0)
## very high 21 (8.1) 15 (13.5)
## Sex Female 154 (59.5) 41 (36.9) <0.001
## Male 105 (40.5) 70 (63.1)
## Travel time (home to school) <15 min 174 (67.2) 68 (61.3) 0.003
## 15–30 min 73 (28.2) 26 (23.4)
## 30–60 min 10 (3.9) 11 (9.9)
## >60 min 2 (0.8) 6 (5.4)
## Parent's cohabitation status Apart 26 (10.0) 12 (10.8) 0.970
## Together 233 (90.0) 99 (89.2)
Using summary_factorlist() from the finalfit package reveals the same.
factor_student_alc %>%
finalfit("high_use",
c("sex", "Pstatus", "traveltime", "freetime"),
metrics = TRUE)
## [[1]]
## Dependent: High alcohol consumption FALSE TRUE
## Sex Female 154 (79.0) 41 (21.0)
## Male 105 (60.0) 70 (40.0)
## Parent's cohabitation status Apart 26 (68.4) 12 (31.6)
## Together 233 (70.2) 99 (29.8)
## Travel time (home to school) <15 min 174 (71.9) 68 (28.1)
## 15–30 min 73 (73.7) 26 (26.3)
## 30–60 min 10 (47.6) 11 (52.4)
## >60 min 2 (25.0) 6 (75.0)
## Free time (after school) very low 15 (88.2) 2 (11.8)
## low 46 (76.7) 14 (23.3)
## medium 112 (73.7) 40 (26.3)
## high 65 (61.9) 40 (38.1)
## very high 21 (58.3) 15 (41.7)
## OR (univariable) OR (multivariable)
## - -
## 0.19 (0.10 to 0.28, p<0.001) 0.15 (0.06 to 0.25, p=0.002)
## - -
## -0.02 (-0.17 to 0.14, p=0.823) -0.06 (-0.21 to 0.09, p=0.464)
## - -
## -0.02 (-0.12 to 0.09, p=0.734) -0.02 (-0.12 to 0.09, p=0.724)
## 0.24 (0.04 to 0.45, p=0.019) 0.25 (0.05 to 0.45, p=0.015)
## 0.47 (0.15 to 0.79, p=0.004) 0.42 (0.10 to 0.73, p=0.010)
## - -
## 0.12 (-0.13 to 0.36, p=0.355) 0.08 (-0.16 to 0.32, p=0.525)
## 0.15 (-0.08 to 0.37, p=0.212) 0.11 (-0.11 to 0.33, p=0.334)
## 0.26 (0.03 to 0.50, p=0.027) 0.22 (-0.01 to 0.45, p=0.066)
## 0.30 (0.04 to 0.56, p=0.026) 0.21 (-0.06 to 0.47, p=0.124)
##
## [[2]]
##
## Number in dataframe = 370, Number in model = 370, Missing = 0, Log-likelihood = -218.3, AIC = 458.6, R-squared = 0.093, Adjusted R-squared = 0.07
factor_student_alc %>%
or_plot(
"high_use",
c("sex", "Pstatus", "traveltime", "freetime"),
title_text_size = 13,
table_text_size = 4,
column_space = c(-3, 3.2, 7.5),
breaks = c(0.5, 1, 2.5, 5, 10, 25, 50)
)
Based on the previous model results, I will exclude Pstatus from further analyses.
model <- glm(high_use ~ sex + freetime + traveltime - 1, data = factor_student_alc, family = "binomial")
probabilities <- predict(model, type = "response")
factor_student_alc <- mutate(factor_student_alc, probability = probabilities) %>%
mutate(factor_student_alc, prediction = probability > 0.5)
table(high_use = factor_student_alc$high_use, prediction = factor_student_alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 7
## TRUE 97 14
table(high_use = factor_student_alc$high_use,
prediction = factor_student_alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68108108 0.01891892 0.70000000
## TRUE 0.26216216 0.03783784 0.30000000
## Sum 0.94324324 0.05675676 1.00000000
ggplot(factor_student_alc,
aes(x = probability, y = high_use, col = prediction)) +
geom_point() +
theme_bw()
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = factor_student_alc$high_use, prob = factor_student_alc$probability)
## [1] 0.2810811
According to the results of cross validation, it seems that 28.1% of the average number of predictions in my final model are wrong, which probably isn’t that great.
date()
## [1] "Tue Dec 6 04:21:16 2022"
The data set, titled “Boston”, used in this exercise contains 14 variables (columns) and 506 observations (rows). It entails information about housing values used to evaluate the willingness of people to pay for cleaner air in the Boston metropolitan area in the 1970s (source).
library(MASS)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Looking at the plots and summary table, it looks like most variables are not equally distributed, such as crim, zn, and rad. Also many of the variables have strong correlations (both positive and negative) with each other based on the plots. The weakest correlating variable seems to be chas, i.e. the Charles River dummy variable. This might be due to it being a binary variable (0 or 1). Interestingly, indus, i.e. the proportion of non-retail business acres per town, and tax, i.e. full-value property-tax rate per 10 000 $, are bimodally distributed.
library(GGally)
ggpairs(Boston, lower = list(
continuous = wrap(
"smooth",
alpha = 0.3,
size = 0.5,
color = "firebrick1"
)
)) + theme_bw()
library(tidyverse)
library(corrplot)
cor_matrix <- cor(Boston) %>% round(digits = 2)
corrplot(
cor_matrix,
method = "circle",
type = "lower",
cl.pos = "b",
tl.pos = "d",
tl.cex = 0.6
)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The original Boston data set had variables measured in different ways and scales/magnitudes. After using scale(), all variables and observations have been normalized to the same scale. This allows for better comparison between them.
boston_scaled <- Boston %>% scale()
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
boston_scaled_df <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled_df$crim)
crime <-
cut(
boston_scaled_df$crim,
breaks = bins,
include.lowest = TRUE
)
boston_scaled_df <- boston_scaled_df %>% dplyr::select(-crim)
boston_scaled_df <- data.frame(boston_scaled_df, crime)
levels(boston_scaled_df$crime) <- c("low", "med_low", "med_high", "high")
n <- nrow(boston_scaled_df)
set.seed(2126)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled_df[ind,]
test <- boston_scaled_df[-ind,]
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <-
function(x,
myscale = 1,
arrow_heads = 0.1,
color = "black",
tex = 0.75,
choices = c(1, 2)) {
heads <- coef(x)
arrows(
x0 = 0,
y0 = 0,
x1 = myscale * heads[, choices[1]],
y1 = myscale * heads[, choices[2]],
col = color,
length = arrow_heads
)
text(
myscale * heads[, choices],
labels = row.names(heads),
cex = tex,
col = color,
pos = 3
)
}
classes <- as.numeric(train$crime)
plot(lda.fit,
dimen = 2,
col = c("red", "blue", "purple", "gold")[classes])
lda.arrows(lda.fit, myscale = 2)
The table shows that the majority of the predicted results of our model overall are correct (74 out of 102, 72.55%). Looking at the table in more detail, the model predicted correctly 16 out of 25 (64.0%) low crime rates, 16 out of 28 (57.14%) medium low crime rates, 12 out of 18 (66.67%) medium high crime rates, and 30 out of 31 (96.77%) high crime rates. This shows that the model is highly accurate in predicting high crime rates, however is somewhat inaccurate with lower crime rates, especially with medium low ones.
correct_classes <- test$crime
new_test <- test %>% dplyr::select(-crime)
lda.pred <- predict(lda.fit, newdata = new_test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 5 1 0
## med_low 8 16 5 0
## med_high 1 7 12 1
## high 0 0 0 30
As it was not specified which distances we were supposed to compute, might as well compute both Euclidian and Manhattan distances. For the initial k-means analysis, I randomly picked three clusters. But after determining the optimal number of clusters the re-analysis with the total of within cluster sum of squares, I ended up with two based on the line plot. That was due to the steepest drop being between one and two clusters. The scaling of the Boston dataset seems to do weird things to the distribution of the rad variable. I’m not sure why. Anyway, looking at the new distribution with k-means of two clusters, the indus, nox, and tax variable have two different colored peaks. This makes sense as I previously stated that these are bimodally distributed. I would assume that the scaled data has two centroids.
data("Boston")
new_boston_scaled <- as.data.frame(scale(Boston))
dist_eu <- dist(new_boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man <- dist(new_boston_scaled, method = "manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
library(RColorBrewer)
k_means <- kmeans(new_boston_scaled, centers = 3)
ggpairs(
as.data.frame(new_boston_scaled),
aes(color = as.factor(k_means$cluster)),
lower = list(continuous = wrap(
"smooth",
alpha = 0.3,
size = 0.5
)),
upper = list(continuous = wrap("cor", size = 2))
) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_bw()
twcss <- sapply(1:10, function(k){kmeans(new_boston_scaled, k)$tot.withinss})
qplot(x = 1:10, y = twcss, geom = "line") +
scale_x_continuous(breaks = c(1:10)) +
labs(x = "clusters") +
theme_bw()
new_k_means <- kmeans(new_boston_scaled, centers = 2)
ggpairs(
new_boston_scaled,
aes(color = as.factor(new_k_means$cluster)),
lower = list(continuous = wrap(
"smooth",
alpha = 0.3,
size = 0.5
)),
upper = list(continuous = wrap("cor", size = 2))
) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_bw()
date()
## [1] "Tue Dec 6 04:22:44 2022"
At a glance, each variable seems to follow a (positive or negative) skewed normal distribution. Only Edu.Exp i.e, expected years of education, is nearly symmetrically distributed. Certain variables, such as gross national income (GNI), maternal mortality ratio (Mat.Mor), and adolescent birth rate (Ado.Birth), have a wide range between their minimum and maximum values as well as their means. This suggests an uneven and highly skewed distribution. Curiously, the highest GNI (123 124), which is also quite an outlier, belongs to Qatar. I wonder whether this has anything to do with its foreign aid/investment(s), oil and gas exports, and its currently hosted FIFA World Cup.
The highest positive and statistically significant correlation is between expected years of education and life expectancy at birth (Life.Exp; 0.789), and conversely, the lowest is between (Mat.Mor) and life expectancy at birth (−0.857). The former makes sense as if you have a high life expectancy, you most likely live in a country in which you do not have to struggle to survive. Therefore, you can spend more time getting a degree. The latter also is sensible: if a mother dies at child birth, the child’s survival decreases, especially in developing countries.
library(tidyverse)
library(GGally)
human_data <-
read.csv("Data/new_human.csv",
header = T,
row.names = "X")
custom_scatter_plots <-
function(data, mapping, method = "lm", ...) {
ggplot(data = data, mapping = mapping) +
geom_point(colour = "dodgerblue",
alpha = 0.3,
size = 1) +
geom_smooth(method = method,
color = "blueviolet",
linewidth = 0.5,
...)
}
ggpairs(
human_data,
lower = list(continuous = wrap(custom_scatter_plots)),
upper = list(continuous = wrap("cor", size = 4, color = "black"))
) +
theme_bw() +
theme(
strip.text = element_text(size = 11, color = "black", face = "bold"),
axis.text.x = element_text(
angle = 45,
hjust = 1,
vjust = 1
)
)
summary(human_data)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The results between standardized and non-standardized human data is quite different. The loads (indicated with red arrows) for the PCA results of the non-standardized human data reveals that the scales between the observations among the variables is really different. Looking at the data, gross national income (GNI), and maternal mortality ratio (Mat.Mor) are measured in the thousands, meanwhile the other variables are either ratios or measured — at most — in the hundreds. This results in them being the most contributing factors in the analysis, and thus creates the boomerang-shaped distribution in the non-standardized plot. The PCA dimensions also look odd as dimension 1 explains 99.99% of the variance in the data, while dimension 2 only 0.01%. Standardizing amends this as the variables are then comparable to each other on a same scale, which can be seen e.g., with the loads looking more reasonable (dimension 1 explains 53.6% and dimension 2 16.24% of the variance), and more circular distribution of data points.
The clustering of countries in the PCA of standardized human data is interesting. Western nations (in Europe and North America) as well as Australia and New Zealand cluster towards female-to-male labor force participation rate (Labo.FM), female percent representation in parliament (Parli.F), expected years of education (Edu.Exp), GNI, female-to-male ratio with secondary education (Edu2.FM), and life expectancy at birth (Life.Exp). This is reasonable as these nations are known for stability, safety, and fulfilling human rights.
Curiously many of the Caribbean and Latin american countries are clustering only to Parli.F and Labo.FM. This suggests that these countries have more/many females in parliamentary positions and in the work force, but also are not the wealthiest (based on the GNI load) nor the safest (based on e.g., Mat.Mor or Life.Exp). The high values in Parli.F and Labo.FM might be due to poor families sending all available/capable family members to work out of necessity.
The Sub-Saharan, ca. half of the East Asian, and most of the South Asian countries cluster mainly due to Mat.Mor and Ado.Birth. Considering these countries are mostly poor developing countries with less-than-great track record with human rights, it is not surprising that females in these regions give more birth while young and die more during child birth most likely due to the former.
Lastly, most of the countries in the Middle East and North Africa cluster towards Edu.Exp, GNI, Edu2.FM, and Life.Exp. High values in GNI are sensible due to, for example, oil and gas exports, whilst in the others the reasons are unclear (at least, for me at the moment of writing).
library(ggfortify)
library(ggthemes)
library(countrycode)
library(patchwork)
pca_human <- prcomp(human_data)
new_human_data <-
human_data %>% mutate(Region = countryname(row.names(human_data), destination = "region"))
pca1 <- autoplot(
pca_human,
data = new_human_data,
colour = "Region",
label = TRUE,
label.size = 4,
label.repel = TRUE,
loadings = TRUE,
loadings.label = TRUE,
loadings.label.size = 4,
loadings.label.repel = TRUE,
) +
scale_color_manual(values = ptol_pal()(length(unique(
new_human_data$Region
))),
guide = guide_legend(override.aes = list(size = 3))) +
geom_hline(yintercept = 0,
linetype = "dashed",
alpha = 0.4) +
geom_vline(xintercept = 0,
linetype = "dashed",
alpha = 0.4) +
ggtitle("Non-standardized human data") +
theme_bw()
pca_human_scaled <- prcomp(scale(human_data))
pca2 <- autoplot(
pca_human_scaled,
data = new_human_data,
colour = "Region",
label = TRUE,
label.size = 4,
label.repel = TRUE,
loadings = TRUE,
loadings.label = TRUE,
loadings.label.size = 4,
loadings.label.repel = TRUE
) +
scale_color_manual(values = ptol_pal()(length(unique(
new_human_data$Region
))),
guide = guide_legend(override.aes = list(size = 3))) +
geom_hline(yintercept = 0,
linetype = "dashed",
alpha = 0.4) +
geom_vline(xintercept = 0,
linetype = "dashed",
alpha = 0.4) +
ggtitle("Standardized human data") +
theme_bw()
pca1 + pca2 + plot_layout(guides = "collect") &
theme(legend.text = element_text(color = "black", size = 10))
Principal component analysis (PCA) of standardized and non-standardized human data. In the left panel, gross national income (GNI), and maternal mortality ratio (Mat.Mor) pull the data in two directions: Affluent oil-exporting countries, such as Qatar, United Arab Emerates, and United States, and banking countries, such as Luxemburg and Switzerland, cluster towards the former; whilst many of the Sub-Saharan African countries are pulled towards the latter. Dimension 1 explains 99.99% of the variance, and dimension 2 explains 0.01%. In the right panel, there are overlapping region clusters pulled in three directions based on loads: countries in the Sub-Saharan Africa cluster towards Mat.Mor, adolescent birth rate (Ado.Birth), female percent representation in parliament (Parli.F), and female-to-male labour force participation rate (Labo.FM); several Latin American and Caribbean countries towards Labo.FM and Parli.F; European, North American, and Pacific countries towards Labo.FM, Parli.F, expected years of education (Edu.Exp), GNI, female-to-male ratio with secondary education (Edu2.FM), and life expectancy at birth (Life.Exp); and lastly, countries in South, East and Central Asia, Middle East, and North Africa towards Edu.Exp, GNI, Edu2.FM, and Life.Exp. Dimensions 1 and 2 explain 53.6% and 16.24% of the variance, respectively.
I decided on running the MCA on six random variables: escape.exoticism, where, dinner, Tea, resto, and pub. Looking at the Eigenvalues in the summary, Dim.1 and Dim.2 explain overall 19.761% and 14.687% of the variance, respectively. In the Categories section, the ctr column in Dim.1 shows five categories with highest contribution (crt > 12.0): chain store+tea shop, tea shop, green, resto, and pub.
The plots show correlations between categories. For example, dinner and tea shop cluster near each other, suggesting strong correlation, which would make sense as people probably would drink tea during dinner at tea shops.
# Load tea data.
tea_data <-
read.csv(
"https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv",
stringsAsFactors = TRUE
)
# Show data structure.
str(tea_data)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# Plot data.
tea_data %>%
dplyr::select(-age) %>% # Removed age due to age_Q already existing as factor.
pivot_longer(cols = everything()) %>%
ggplot(aes(value, fill = name)) +
geom_bar() +
guides(fill = "none") +
theme_bw() +
theme(axis.text.x = element_text(
angle = 45,
hjust = 1,
size = 8
)) +
facet_wrap("name", scales = "free")
library(FactoMineR)
# Pick randomly six variables.
set.seed(2133)
interesting_vars <- sample(colnames(tea_data), 6)
new_tea_data <- tea_data %>% dplyr::select(all_of(interesting_vars))
# Run MCA.
mca_tea_data <- MCA(new_tea_data, graph = FALSE)
# summary(mca_tea_data, nbelements = Inf) # nbelements = Inf prints whole summary with its 300 individuals.
summary(mca_tea_data)
##
## Call:
## MCA(X = new_tea_data, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.263 0.196 0.182 0.164 0.153 0.136 0.123
## % of var. 19.761 14.687 13.629 12.290 11.484 10.230 9.258
## Cumulative % of var. 19.761 34.448 48.076 60.366 71.850 82.080 91.338
## Dim.8
## Variance 0.115
## % of var. 8.662
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.260 0.085 0.078 | -0.298 0.151 0.102 | 0.709
## 2 | 0.133 0.022 0.020 | -0.609 0.632 0.411 | 0.247
## 3 | 0.123 0.019 0.005 | 0.670 0.764 0.147 | -0.241
## 4 | 0.454 0.260 0.076 | 0.166 0.047 0.010 | -0.954
## 5 | 0.059 0.004 0.007 | -0.524 0.468 0.563 | -0.425
## 6 | 0.580 0.425 0.127 | 0.477 0.388 0.086 | -0.493
## 7 | 0.185 0.043 0.076 | -0.213 0.077 0.100 | 0.036
## 8 | 0.260 0.085 0.078 | -0.298 0.151 0.102 | 0.709
## 9 | -0.133 0.022 0.021 | 0.153 0.040 0.028 | -0.030
## 10 | -0.058 0.004 0.003 | 0.067 0.008 0.004 | 0.642
## ctr cos2
## 1 0.922 0.578 |
## 2 0.112 0.068 |
## 3 0.107 0.019 |
## 4 1.671 0.339 |
## 5 0.332 0.371 |
## 6 0.445 0.091 |
## 7 0.002 0.003 |
## 8 0.922 0.578 |
## 9 0.002 0.001 |
## 10 0.756 0.330 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## escape-exoticism | -0.205 1.258 0.038 -3.360 | -0.436 7.646
## Not.escape-exoticism | 0.184 1.131 0.038 3.360 | 0.392 6.872
## chain store | 0.124 0.623 0.027 2.860 | -0.464 11.748
## chain store+tea shop | -0.855 12.027 0.257 -8.765 | 0.505 5.649
## tea shop | 1.429 12.924 0.227 8.239 | 1.658 23.410
## dinner | 1.130 5.656 0.096 5.362 | 1.704 17.301
## Not.dinner | -0.085 0.426 0.096 -5.362 | -0.128 1.302
## black | -0.055 0.048 0.001 -0.546 | -0.208 0.905
## Earl Grey | -0.284 3.278 0.145 -6.592 | 0.019 0.019
## green | 1.784 22.141 0.393 10.844 | 0.356 1.189
## cos2 v.test Dim.3 ctr cos2 v.test
## escape-exoticism 0.171 -7.142 | -0.622 16.795 0.348 -10.196 |
## Not.escape-exoticism 0.171 7.142 | 0.559 15.094 0.348 10.196 |
## chain store 0.383 -10.707 | 0.054 0.170 0.005 1.242 |
## chain store+tea shop 0.090 5.179 | -0.117 0.325 0.005 -1.197 |
## tea shop 0.306 9.559 | -0.041 0.015 0.000 -0.237 |
## dinner 0.219 8.084 | -1.258 10.166 0.119 -5.970 |
## Not.dinner 0.219 -8.084 | 0.095 0.765 0.119 5.970 |
## black 0.014 -2.055 | 1.228 34.109 0.494 12.149 |
## Earl Grey 0.001 0.434 | -0.492 14.290 0.437 -11.429 |
## green 0.016 2.167 | 0.125 0.157 0.002 0.758 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## escape.exoticism | 0.038 0.171 0.348 |
## where | 0.404 0.479 0.006 |
## dinner | 0.096 0.219 0.119 |
## Tea | 0.403 0.025 0.529 |
## resto | 0.384 0.051 0.080 |
## pub | 0.256 0.231 0.008 |
par(mfrow = c(1, 2)) # Plot figures next to each other.
plot.MCA(
mca_tea_data,
invisible = "ind",
graph.type = "classic",
habillage = "quali",
title = "MCA factor map"
)
plot.MCA(
mca_tea_data,
invisible = "ind",
graph.type = "classic",
habillage = "quali",
selectMod = "contrib 5",
# Show five categories that contribute the most (crt > 12.0).
title = "MCA factor map with most contributing categories"
)
(More chapters to be added…)